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ABSTRACT 

We  discuss  some  aspects  of  implementing  the  finite  element  method  on  parallel 
computers  with  local  memory  and  message  passing.  In  particular,  we  compare  the 
costs  of  using  high  order  and  low  order  elements,  and  of  direct  and  iterative  solvers 
for  solving  the  linear  systems  that  occur.  Our  model  of  parallel  computation  is  a 
two-dimensional  grid  of  processors  chosen  to  be  similar  in  shape  to  the  underlying 
grid.  Our  main  conclusions  are  that  sparse  direct  solvers  generalize  naturally  to 
methods  based  on  high  order  elements,  and  that  direct  solvers  are  adequate  for 
two-dimensional  problems,  especially  for  multiple  load  vectors.  We  also  demonstrate 
that  high  order  methods  achieve  higher  accuracy  at  less  cost  on  some  typical  model 
problems. 
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Abstract.  We  discuss  some  aspects  of  implementing  the  finite  element  method  on 
parallel  computers  with  local  memory  and  message  passing.  In  particular,  we  compare 
the  costs  of  using  high  order  and  low  order  elements,  and  of  direct  and  iterative  solvers 
for  solving  the  linear  systems  that  occur.  Our  model  of  parallel  computation  is  a  two- 
dimensional  grid  of  processors  chosen  to  be  similar  in  shape  to  the  underlying  grid.  Our 
main  conclusions  are  that  sparse  direct  solvers  generalize  naturally  to  methods  based  on 
high  order  elements,  and  that  direct  solvers  are  adequate  for  two-dimensional  problems, 
especially  for  multiple  load  vectors.  We  also  demonstrate  that  high  order  methods  achieve 
higher  accuracy  at  less  cost  on  some  typical  model  problems. 

1.  Introduction 

The  finite  element  method  is  a  major  computational  tool  of  engineering.  However, 
large  problems,  especially  in  three  dimensions,  are  not  practically  solvable  on  present- 
day  serial  computers.  Improvements  in  computing  technology,  based  on  the  emergence  of 
large-scale  parallel  architectures,  offer  the  possibility  of  expanding  the  set  of  problems  that 
can  be  solved  effectively.  Such  machines  appear  particularly  attractive  for  finite  element 
algorithms,  in  which  the  local  elements  provide  a  natural  decomposition  of  the  problem 
into  (partially)  independent  sets.  The  efficiency  of  any  numerical  algorithm  implemented 
on  a  parallel  computer  depends  on  both  arithmetic  and  communication,  and  many  aspects 
of  finite  element  computations,  most  notably  those  involving  linear  algebra,  are  not  fully 
parallel.  In  this  paper,  we  address  the  issues  associated  with  finite  element  analysis  on 
parallel  computers  in  a  simplified  way,  hoping  to  get  insight  into  how  to  use  such  machines 
effectively.  We  focus  on  two  basic  issues: 

1.  a  comparison  between  high  order  and  low  order  basis  functions; 

2.  a  comparison  between  direct  methods  and  iterative  methods  for  solving  the  linear 
algebraic  systems  that  arise. 

We  consider  three  versions  of  the  finite  element  method:  the  standard  h- version,  which 
uses  low  order  basis  functions  and  achieves  accuracy  by  refining  the  mesh;  the  p-version, 
which  uses  a  fixed  mesh  and  achieves  accuracy  by  using  high  order  basis  functions;  and 
the  bp-version,  which  combines  these  two  approaches.  See  [2j  for  a  survey  of  results  for 
the  latter  two  methods.  All  these  techniques  produce  a  set  of  one  or  more  linear  systems 
of  equations  where  the  coefficient  matrix  is  the  global  stiffness  matrix.  Our  strategy  for , 
solving  these  systems  in  parallel  is  to  partition  the  problem  among  the  available  processors 
and  apply  local  elimination  inside  each  processor,  sc  that  unknowns  “interior”  to  processors 
are  decoupled  from  the  system.  For  computing  the  other  unknowns,  we  examine  both  direct 
solvers  based  on  the  parallel  nested  dissection  method  (5,6,17)  and  parallel  versions  of  the 
conjugate  gradient  method  (CG).  The  latter  strategy  is  related  to  domain  decomposition 
methods,  for  which  parallel  implementations  using  “fast  direct”  local  elimination  have  been 
considered  in  [9,10j. 

For  all  these  techniques,  we  perform  an  analysis  of  the  computational  complexity 
of  their  parallel  implementation,  and  we  perform  a  series  of  numerical  experiments  that 
determine  the  accuracy  of  finite  element  solutions  of  model  problems  for  various  choices 
of  basis  functions.  Combining  the  analytic  and  numerical  results  gives  estimates  of  the 


overall  parallel  costs.  Our  model  of  parallel  architecture  is  a  two-dimensional  k  x  k  grid  of 
processors  with  nearest  neighbor  connections.  Each  processor  has  access  to  its  own  local 
memory,  and  data  can  be  communicated  only  between  neighboring  processors. 

An  outline  of  the  paper  is  as  follows.  In  Section  2,  we  describe  the  model  problem  and 
the  finite  element  methods  used  for  discretization,  and  in  Section  3,  we  give  an  overview 
of  the  computations  needed  for  solution.  In  Section  4,  we  present  a  cost  analysis  of  the 
solution  techniques,  including  direct  local  elimination  and  global  elimination  by  both  direct 
methods  and  iterative  methods.  In  Section  5,  we  combine  these  results  with  the  results 
of  numerical  experiments  determining  the  effectiveness  of  various  finite  element  methods 
and  iterative  solvers  to  assess  the  overall  costs  of  solution  techniques,  and  in  Section  6  we 
draw  conclusions. 

2.  The  Model  Problem  and  Finite  Element  Discretizations 
Consider  the  model  problem 

~  2^  aika —  =  fionn,  t  =  l,...,s,  (2.1a) 

y,*=i  ax>  aXk 

dui 

—  =gi  on  50,  (2.16) 

where  fl  is  a  square  domain,  a3k  =  a*y  and  the  multiload  /*,  gi  satisfy  the  usual  solvability 
conditions.  We  will  cast  the  problem  into  the  standard  variational  form:  to  seek  u,  6 
Hl(Cl)  so  that 

B(ui,v)  =  Fj(v)  (2.2) 

holds  for  all  v  €  /f1(n),  where 


.  V"  f  du  dv  , 

(2.3a) 

F,(v)  =  /  fivdx+  /  gvds,  *  =  l,...,s. 

J  n  Jdu 

(2.36) 

The  solution  exists  and  is  unique  up  to  an  additive  constant.  We  pose  (2.1)  on  a  square 
domain,  but  our  arguments  apply  in  general  to  any  domain  that  is  topologically  a  square. 

The  finite  element  method  consists  of  the  selection  of  the  subspace  5  C  Hl(Q)  and 
computation  of  approximate  solutions  u,(5)  €  S  that  satisfy 

B(u,(5),v)  =  F,(v)  VveS.  (2.4) 

The  goal  is  to  obtain  «j(5)  such  that 


||u<(5)  -  Kills  <  T, 


(2.5) 


Figure  2.1:  The  standard  square  S. 


where  ||u||e  =  B( u,u)1//2  is  the  energy  norm  and  r  is  an  a  priori  given  tolerance.  One 
strategy  is  to  use  a  sequence  of  spaces  S^,l  =  1,2,...  and  compute  until  the  result 
satisfies  (2.5).  We  restrict  our  attention  to  the  energy  norm,  although  other  measures  are 
sometimes  more  important  in  practice. 

The  quality  of  the  finite  element  solution  u,-(5)  and  the  computational  work  to  com¬ 
pute  it  depend  on  the  regularity  of  u,  and  properties  of  S,  including  its  dimensionality, 
which  determines  the  size  of  the  linear  system  to  be  solved.  For  solving  (2.1)  on  a  k  x  k  grid 
of  processors,  we  will  divide  fl  into  fc2  quadrilateral  “super-elements”  Dt  ,,  1  <  <  k. 

Each  super-element  My  is  the  image  of  the  standard  square  S  =  [ — 1, 1  j  x  [-1,1]  with 
vertices  A<  and  sides  I\,  i  =  1,...,4,  see  Figure  2.1.  Let  A/,y  denote  the  mapping  from 
S  onto  My,  which  we  assume  is  smooth.  We  associate  with  S  a  finite  dimensional  space 
$  which  is  the  span  of  basis  functions  {<£}  =  U  U  Here  {<£(°)}  are 

nodal  shape  functions  associated  with  the  vertices  {A,},  are  side  shape  functions 

associated  with  the  sides  {I\}  and  {<^2^}  are  internal  shape  functions  associated  with  the 
interior  of  S.  A  nodal  shape  function  associated  with  A,  is  zero  on  the  opposite  sides  r,_i, 
r<+2,  a  side  shape  function  associated  with  T,  is  zero  on  Ty,  j  ^  t,  and  the  internal  shape 
functions  are  zero  on  Ul\.  These  functions  determine  a  basis  for  My  under  composition 
with  M~-1 .  The  basis  functions  are  chosen  so  that  S  C  Hl(Cl). 

Some  examples  of  basis  functions  are  as  follows:  ' 

(a)  Shape  functions  for  the  h-version.  $  is  divided  into  triangles  as  shown  in  Figure  2.2, 
left.  The  shape  functions  are  the  piecewise  linear  “hat  functions;”  some  examples  of  the 
supports  of  such  functions  are  shown  in  the  figure.  In  an  analogous  way,  we  can  divide  S 
into  squares  and  use  piecewise  bilinear  shape  functions  (Figure  2.2,  right). 

(b)  Shape  functions  for  the  p-  and  hp-versions.  For  the  p-versions,  there  are  4  nodal  shape 

functions  which  are  bilinear.  For  p  >  2  there  are  4(p—  1)  side  shape  functions.  For  example, 
for  the  side  p  =  —  1  the  shape  function  of  degree  j  in  $  is  =  (/f  l  lj{t)dt)(l-p)/2 

where  /y  is  the  Legendre  polynomial  of  order  j.  The  internal  shape  functions  are  given  by 
the  tensor  product  of  the  nodal  and  side  functions.  It  is  possible  to  restrict  the  number  of 
internal  shape  functions  to  |(p  -  2)(p  -  3)  for  p  >  4  (and  zero  otherwise),  so  that  the  span 
contains  the  complete  set  of  polynomials  of  degree  p.  This  set  is  used  in  the  commercial 
code  PROBE  [16,18].  Hence,  there  is  a  total  of  4 p+  (for  p  >  4)  ^(p  —  2)(p  —  3)  basis 
functions.  For  the  hp- version,  S  is  first  divided  into  squares,  and  the  p-versic.i  shape 


Figure  2.2:  Supports  of  some  shape  functions  for  the  h-version  with  triangular  and  square 
elements. 


functions  are  used  on  each  square. 

(c)  Shape  functions  for  spectral  methods  [13, 15).  On  every  side  of  S,  define  p+  1  Lobatto 
quadrature  points  and  polynomials  Oi(t)  such  that  £,(£,)  =  6,y,  the  Kronecker  6.  The 
shape  functions  of  $  are  then  the  tensor  product  functions  0,y(£,  q)  =  0,(f)0y(»7).  It  is 
straightforward  to  divide  these  functions  into  groups  of  nodal,  side  and  internal  shape 
functions. 

Using  the  basis  functions  of  <£,  we  can  easily  construct  basis  functions  of  S  from  those 
defined  on  { Dl} } .  Let  these  be  denoted  (0,  },  so  that  Ui(S)  =  Eysj-%,  F<(0y)  =  yj‘\ 
and  (2.4)  reduces  to  a  system  of  linear  equations 

GxW  =  y«, 

where  G  is  the  Gramm  matrix  ['y.y]  with  ^y  =  £(0i,0y).  The  unknowns  can  be 
visualized  as  located  in  the  nodal  points,  sides  and  interiors  of  Dij  as  in  Figure  2.3. 


x  Nodal 


□ 


9  Internal 


Figure  2.3:  Distribution  of  unknowns  on  super-elements. 


(2.6) 


3.  Overview  of  Computations 

We  would  like  to  compute  u,(S)  to  the  desired  accuracy  at  the  lowest  possible  cost,  in 
terms  of  work  and  storage.  Thus,  we  wish  to  understand  the  connection  between  the  error 


||u,(S)  -  us- 1| ^7  and  the  cost  of  computing  u,(S).  This  will  depend  on  the  choice  of  5  and 
its  basis,  the  regularity  of  the  solution  u,-,  the  method  of  linear  system  solution,  and  the 
computer  architecture.  In  this  section,  we  give  an  overview  of  the  required  computations. 
They  can  be  divided  conceptually  into  the  following  four  steps: 

(1)  Computation  of  local  stiffness  matrices  and  local  load  vector(s).  The  Gramm  matrix 
and  load  vectors  of  (2.6)  are  linear  combinations  of  local  stiffness  matrices  (or  local  load 
vectors),  defined  on  the  individual  super-elements  Dij.  The  local  stiffness  computations 
entail  quadratures  to  evaluate  B(u,v)  and  F\(v)  on  Dij.  We  will  not  consider  global 
assembly  of  G ,  but  instead  will  examine  local  eliminations  prior  to  assembly.  Hence,  these 
computations  are  fully  parallelizable. 

(2)  Local  elimination.  The  local  matrices  and  load  vectors  can  be  written  in  block  form 


A  B 
Bt  C 


)•  »• 


A  corresponds  to  the  connections  among  internal  unknowns  in  a  super-element,  except  for 
those  adjacent  to  the  boundary  of  f2,  where  A  also  includes  connections  among  all  side 
and  nodal  unknowns  not  associated  with  other  super-elements.  C  corresponds  to  connec¬ 
tions  among  unknowns  on  the  boundary  (i.e.  nodal  and  side)  D, y,  and  B  corresponds  to 
connections  between  interior  and  boundary  unknowns.  The  structure,  e.g.  sparsity,  of  A , 
B  and  C  depends  on  the  choice  of  shape  functions.  For  example,  the  h-ve rsion  produces 
sparse  matrices  and  the  p-version  leads  to  full  matrices.  For  any  methods,  the  interior 
unknowns  can  be  decoupled  from  those  on  the  domain  interfaces  by  computing  the  Schur 
complement 

C  4-  C  -  BtA~ lB,  (3.2) 

and  modifying  all  local  load  vectors  similarly  by 


c  *—  c  —  Br  A  lb.  (3.3) 

All  these  computations  are  fully  parallelizable.  For  the  h-version,  explicit  local  elimination 
is  performed  using  the  serial  nested  dissection  method  [7,8],  and  for  the  p-version,  dense 
elimination  is  used. 

(3)  Interface  solution.  We  define  the  global  interface  matrix  G  to  be  the  coefficient  matrix  of 
the  linear  system  for  the  unknowns  on  the  domain  interfaces  after  the  interior  unknowns  are 
decoupled  from  the  system.  Interface  load  vectors  g  are  defined  analogously.  G  is  computed 
explicitly  by  adding  components  of  the  local  stiffness  matrices  after  local  elimination  (3.2). 
The  result  is  one  or  more  systems  of  linear  equations  of  the  form 

Gv  =  g  (3.4) 

which  must  be  solved  for  the  interface  unknowns  v  corresponding  to  the  load  vector  g. 

(4)  Local  backsolves.  For  any  super-element  Dij,  let  v,y  denote  the  component  of  v  on  the 
boundary  of  Dij  determined  from  the  solution  of  (3.4).  The  interior  values  u,y  are  then 
obtained  by  solving  (in  parallel  for  all  super-elements)  the  local  system  Auty  =  b  —  Bvij. 


I 


(5)  Postprocessing.  After  determining  all  solutions  corresponding  to  different  load  vectors, 
it  is  usually  necessary  to  determine  the  values  of  interest.  This  is  done  on  an  interelement 
level,  and  we  do  not  consider  it  here. 

4.  Cost  Analysis. 

In  this  section,  we  give  a  detailed  cost  analysis  for  the  two-dimensional  model  prob¬ 
lem.  For  the  finite  element  method,  we  compare  the  b-version,  the  p-version,  and  the 
bp-version  as  discretizations  inside  super-elements,  with  m2  quadrilateral  elements  for  the 
b-  and  bp-versions.  For  solving  on  the  domain  interfaces,  we  compare  the  costs  of  direct 
solution  based  on  a  parallel  implementation  of  the  nested  dissection  method,  with  those 
of  iterative  solution  using  the  conjugate  gradient  method.  We  will  examine  CG  without 
preconditioning  (although  the  local  elimination  could  be  viewed  as  a  preconditioner),  and 
use  of  a  submatrix  of  the  interface  matrix  as  a  preconditioner.  We  will  also  distinguish 
between  the  cases  of  one  and  more  than  one  right  hand  side. 

In  our  analysis,  we  state  the  cost  of  arithmetic  in  terms  of  number  of  floating  point 
multiplications/divisions.  ( Additions /subtractions  are  essentially  in  one-to-one  correspon¬ 
dence  with  these.)  For  interprocessor  communication,  we  assume  that  a  processor  can 
communicate  with  all  of  its  neighbors  simultaneously,  but  at  any  given  moment,  data  can 
move  in  only  one  direction  between  two  processors.  Communication  costs  are  specified 
in  terms  of  “items  of  data.”  We  make  the  simplifying  assumption  that  blocks  of  data  of 
arbitrary  size  can  be  sent  (sizes  will  be  chosen  to  fit  algorithms),  but  each  send  entails  a 
startup  cost.  We  do  not  consider  overlapping  arithmetic  and  communication. 

4.1.  Local  Computations. 

The  fully  local  computations,  local  stiffness  computation  (step  1),  elimination  (step 
2),  and  backsolve  (step  4)  are  performed  simultaneously  in  every  processor.  For  simplicity, 
we  assume  that  the  same  discretization  is  used  in  every  super-element.  Let  D  denote 
one  super-element.  For  the  b-  and  bp-versions,  D  is  divided  into  an  m  x  m  grid  of  local 
quadrilateral  elements,  and  for  the  the  p-version,  no  subdivision  of  D  is  made. 

The  local  stiffness  matrix  and  load  computations  are  performed  by  quadratures  on  D. 
The  complexity  depends  on  the  choice  of  basis  functions,  but  it  also  depends  strongly  on  the 
cost  of  evaluating  the  coeficients  a;  *  and  loads  /j,  gi  of  (2.1),  and,  if  curved  quadrilaterals 
are  used,  the  Jacobian  Jm .  Since  these  costs  cannot  be  stated  too  precisely,  we  limit  our 
attention  to  orders  of  magnitude. 

(a)  The  h-veraion  with  bilinear  elements.  The  local  stiffness  matrix  is  sparse,  symmetric 
and  of  order  (m  +  l)2,  with  9  diagonals.  Each  entry  comes  from  0(1)  quadratures,  so 
a  total  of  Cjm2  operations  is  needed,  where  c\  is  strongly  dependent  on  the  coefficients 
and  Jacobian.  Similarly,  computation  of  the  load  vector  requires  c-im  operations  where  c 2 
depends  on  /  and  g.  We  estimate  that  reasonable  values  of  c\  and  c 2  are  50  -  100.  In 
special  cases  (such  as  constant  coefficients),  the  operation  count  reduces  to  essentially  zero 
because  the  same  stencil  appears  in  all  elements. 

(b)  The  p-  and  hp-versions.  For  the  p-version  (p  >  4),  D  is  not  subdivided,  and  the 
local  stiffness  matrix  is  full  of  order  Q  =  4 p+  (for  p  >  4)  j(p  —  2)(p  —  3)  .  Practical 
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values  of  p  are  about  6  —  10.  To  achieve  reasonable  accuracy  when  the  coefficients  aXJ 
or  Jacobian  is  not  close  to  a  constant,  Gauss  quadratures  with  (ap)2  quadrature  points 
are  used  where  a  >  1.  The  computation  of  the  one-dimensional  shape  functions  on  ap 
points  needs  c$p2  operations,  and  using  the  tensor  product  form  of  the  shape  functions, 
the  construction  of  the  local  stiff  matrix  needs  C4P4  operations,  where  c«  as  1  to  1.  In 
addition,  there  are  3p2  evaluations  of  coefficients  a,-;  and  Jacobians,  resulting  in  a  total 
cost  of  C4P3  +  C5P2  operations.  It  it  often  the  case  that  e$p 2  is  larger  than  the  first  term, 
so  a  reasonable  estimate  for  the  total  cost  is  cp4  where  c  as  1.  For  the  hp-version,  these 
costs  are  multiplied  by  m2,  except  in  the  special  case  of  constant  coefficients. 

The  local  eliminations  are  performed  by  some  version  of  Gaussian  elimination,  e.g. 
making  use  of  a  factorization  A  =  LLT .  It  is  possible  to  give  precise  specification  of  the 
costs.  For  the  p-method,  local  elimination  can  be  described  as  a  set  of  dense  block  matrix 
operations,  where  the  blocks  are  as  in  (3.1): 

Algorithm  1:  Local  elimination  and  backsolve. 

Elimination: 

a)  Cholesky  factorization  A  =  LLT, 

b)  Block  forward  solve  B  <—  L~lB , 

c)  Update  C  *—  C  —  BTB, 

d)  Forward  elimination  for  each  load  vector  b:  6  <—  L~lb  and  c  <—  c  -  BTb, 

Backsolve: 

e)  Given  boundary  values  v,  compute  interior  values  u  ♦—  L~T{b  —  Bv). 

Steps  (a)  -  (c)  are  independent  of  the  number  of  load  vectors;  steps  (d)  and  (e)  are  per¬ 
formed  for  each  load  vector.  Note  that  this  algorithm  must  be  combined  with  computation 
of  the  interface  unknowns,  which  takes  place  between  steps  (d)  and  (e). 

For  the  /i-version,  the  matrices  A ,  B  and  C  are  sparse,  and  we  consider  use  of  the 
nested  dissection  method,  in  which  the  rows  and  columns  of  A  are  symmetrically  permuted 
to  minimize  fill-in,  see  [7]  for  details.  (See  also  Section  4.2  for  a  parallel  version  used 
for  elimination  on  the  super-element  interfaces.)  Formally,  the  main  modification  of 
Algorithm  1  is  that  step  (a)  is  applied  to  PAPT ,  where  P  is  a  permutation  matrix  that 
implements  the  reordering.  For  the  hp-method,  Algorithm  1  is  applied  in  serial  to  each 
element  of  D  to  decouple  the  internal  unknowns  of  local-  elements  from  those  on  local 
interfaces,  and  then  it  is  applied  using  a  nested  dissection  ordering  to  eliminate  the  local 
interfaces. 

The  following  result  summarizes  the  floating  point  multiplication  counts  for  internal 
elimination.  A  proof  is  given  in  the  Appendix. 

Theorem  1:  The  high  order  multiplication  counts  of  the  local  elimination  used  for  two- 
dimensional  problems  are 

i(  ^6  3  5  17  .  341  /371  3  a  371  2  2  2  \ 

m  («P  +  l6P  +  TtP  "  IF”  )  +  (lFm  P  ~l2mp  -1Tmp  log2m) 

for  factorization,  and 
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for  forward-  and  back-substitutions. 

For  both  the  expressions  of  this  result,  the  first  term  in  the  sum  comes  from  treating  the 
internal  unknowns  on  each  of  m2  local  elements  in  D,  and  the  second  term  comes  from 
treating  local  element  interfaces  inside  D.  Note  that  the  factorization  is  performed  only 
once,  whereas  the  forward-  and  back-solves  are  performed  once  for  each  load  vector.  Also 
note  that  for  practical  values  of  p,  the  0(p 6)  term  is  not  the  governing  one. 

On  a  serial  computer,  all  computations  are  local  and  costs  are  summarized  the  fol¬ 
lowing  result. 

Theorem  2:  The  high  order  multiplication  counts  for  serial  factorization  are 

2/  1  0  ,  3  5  17  4  341  3\  ,  /  267  3  3  371  2  ,  2  j ,  \ 
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The  costs  are  slightly  lower  than  those  obtained  by  replacing  m  with  n  in  Theorem  1 
because  the  serial  algorithm  achieves  some  savings  near  the  boundaries  (see  [7]).  The 
proof  is  essentially  the  same  as  that  of  Theorem  1. 


4.2.  Parallel  Direct  Solution  for  Interfaces 

After  the  internal  unknowns  are  decoupled  from  the  system,  the  result  is  one  or  more 
systems  of  equations  of  the  form  (3.4)  whose  unknowns  are  associated  only  with  super¬ 
element  interfaces.  This  situation  is  depicted  in  Figure  4.1.  To  simplify  the  analysis,  we 
assume  that  the  number  k  of  processors  in  each  dimension  is  a  power  of  two.  If  D  denotes 
a  super-element  associated  with  some  processor,  let  d  denote  the  number  of  unknowns  on 
each  side  of  D,  including  one  nodal  unknown.  Thus,  each  (interior)  super-element  has  Ad 
unknowns  associated  with  it,  where  unknowns  lying  on  a  node  are  associated  with  four 
super-elements,  and  those  on  an  edge  are  associated  with  two.  Conceptually,  whether  these 
unknowns  come  from  the  h-,  p-  or  hp- versions  of  the  finite  element  method  is  irrelevant;  we 
simply  have  d  =  mp  for  all  methods.  The  order  of  G  is  then  N  —  (2d  —  l)fc2  +  2kd-\-l.  If  the 
unknowns  are  labeled  from  1  to  N,  then  an  entry  G,;  is  nonzero  iff  unknowns  t  and  j  are 
associated  with  a  common  subdomain.  It  has  been  shown  in  [5,6,17]  that  these  unknowns 
can  be  computed  with  0(1)  efficiency  using  the  parallel  nested  dissection  method.  In  this 
section,  we  give  a  high-level  description  of  this  algorithm  and  summarize  its  costs.  A 
detailed  description  is  given  in  the  Appendix. 

The  method  consists  of  log2  k  steps,  described  loosely  as  follows.  At  the  t’th  step,  a 
set  of  four  domains  from  step  t  —  1,  each  containing  4 6t  boundary  unknowns,  is  merged 
into  a  larger  domain  and  a  q  x  ict  processor  grid  is  used  to  decouple  the  interior 
unknowns  of  D W  from  the  boundary  unknowns.  (See  Figure  4.2  left.)  This  procedure  is 
repeated  recursively,  with  6t+i  =  2 St  and  **  =  2*.  For  example,  the  original  k 2  super- 
element  domains  {D^  1 1  <  t, j  <  k }  (one  per  processor)  each  contain  61  =  d  boundary 
unknowns  on  each  edge.  These  k 2  domains  are  grouped  into  (k/ 2) 2  square  sets  containing 
four  domains  each,  where  in  each  set  the  four  domains  are  contiguous  at  one  node.  Then, 
simultaneously  for  each  set,  the  four  domains  are  merged  into  a  larger  domain,  resulting 
in  a  set  of  (k/ 2)2  new  domains  {D^  |  1  <  i,j  <  k/ 2}  each  containing  26\  boundary 
unknowns  per  edge  and  residing  ona/ci  x  k  1  =  2x2  grid  of  processors. 
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Figure  4.1:  Configuration  of  unknowns  after  local  elimination. 


Algebraically,  “merging”  of  subdomains  means  assembling  four  stiffness  matrices  local 
to  the  subdomains  {£>(*" x)}  into  a  new  stiffness  matrix  for  D ^  (and  analogues  for  load 
vectors).  As  in  [5],  let  the  unknowns  for  the  merged  domain  D W  be  divided  into  pt  sets 
each  of  size  approximately  St ,  where  for  1  <  t  <  log2  k  —  2,  pt  =  12.  The  distribution 
of  unknowns  is  as  in  Figure  4.2,  left.  In  the  last  two  steps,  there  are  fewer  boundary 
unknowns  (or  none  at  all):  piog*-i  =  8  and  Piogfc  =  4.  In  this  discussion,  we  focus  on  the 
case  t  <  log2  k  —  2.  Let  the  interior  unknowns  be  labeled  with  the  integers  1  through  4,  and 
the  the  boundary  values  labeled  5  through  12.  The  stiffness  matrix  for  D W  is  then  a  block 
12  x  12  matrix,  denoted  by  5 W  (Figure  4.2,  right).  Contributions  to  each  of  the  blocks  of 
SM  come  from  the  parts  of  the  local  matrices  from  the  previous  step  whose  subdomains 
are  associated  with  that  block.  For  example,  contains  contributions  from  the  local 
matrices  for  subdomain  and  subdomain  Z?(2-,).  Merging  consists  of  redistributing 

the  four  local  stiffness  matrices  from  four  separate  jct_i  x  /ct_i  grids  to  one  Kt  x  /ct  grid, 
and  then  summing  the  contributions  to 
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Figure  4.2:  A  merged  domain  D (*)  and  the  approximate  nonzero  structure  of  its  local 
matrix 


As  usual  for  the  nested  dissection  method,  the  interior  unknowns  of  each  merged 
domain  D (*)  form  a  cross  (Figure  4.2,  left).  To  examine  the  decoupling  of  the  interior 
points  of  from  its  boundary,  we  temporarily  drop  the  index  t  from  the  discussion, 
letting  D  represent  a  domain  at  some  step  t  <  log2  k  —  2.  The  decoupling  is  based 
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on  parallel  block  Gaussian  elimination.  In  an  implementation,  it  would  be  necessary  to 
identify  the  twelve  sets  of  unknowns  precisely,  specifying  in  particular  how  points  on  the 
boundary  of  two  or  more  sets  (such  as  the  center  of  the  cross)  are  labeled.  We  avoid 
this  precise  identification,  instead  deriving  an  upper  bound  for  the  costs  by  considering  a 
matrix  5  with  a  simpler  structure:  we  take  5  to  be  a  block  12  x  12  matrix  with  square 
blocks  whose  nonzero  blocks  are  those  explicitly  identified  in  Figure  4.2.  These  blocks  are 
taken  to  be  dense  of  order  6.*  Algorithm  2  is  a  version  of  the  factorization,  block  forward 
solve  and  update  steps  (analogues  of  steps  (a)-(c)  of  Algorithm  1)  used  to  eliminate  the 
first  four  blocks  of  5.  At  each  step,  it  is  applied  simultaneously  to  all  the  local  matrices  5 
associated  with  domains  D  for  that  step.  The  algorithm  takes  advantage  of  sparsity  and 
symmetry  by  operating  only  on  nonzero  entries  of  the  block  upper  triangle  of  S.  At  the 
end  of  this  computation,  the  lower  right  8x8  block  is  to  be  merged  with  three  others  at 
the  next  step. 

Algorithm  2:  Eliminate  a  cross  for  nested  dissection, 
for  i  =  1  to  4 

factor  Su  into  LuL* 
for  j  =  *  +  1  to  pt 
if  (Sij  #  0)  5,,- 
end 

for  j  =  i  +  1  to  pj 
for  k  —  j  to  pt 

if  {Sij  ^  0  and  5,*  ^  0)  5,-fc  <—  Sjk  —  S? 5,* 
end 
end 
end 


L«Sij 


Cholesky  factorization 


Block  forward  solves 


Matrix-matrix  product 
(for  update) 


The  factorization  step  of  the  parallel  nested  dissection  is  now  described  by  the  follow¬ 
ing  algorithm,  each  step  of  which  is  performed  on  a  ict  x  k%  grid  of  processors: 

Algorithm  3:  Factorization  by  parallel  nested  dissection. 

For  t  —  1  to  log2  k 

Local  assembly:  Assemble  5 W;  merge  the  four  subdomains  from  step  t  —  1 
by  redistributing  and  adding  four  local  submatrie'es. 

Eliminate  the  interior  cross:  Apply  Algorithm  2  with  5  =  S^>. 
end 

The  costs  are  determined  from  the  individual  costs  of  the  merge  and  the  large  scale  compu¬ 
tations  of  the  factorization  (Cholesky  factorization,  block  forward  solve  and,  matrix-matrix 
product)  on  a  Kt  x  /ct  processor  grid.  In  our  complexity  analysis  (here  and  in  the  remainder 
of  the  section),  we  assume  that  mp  >  2  and  k  >  4.  The  first  assumption  means  that  the 
problem  is  not  very  small  relative  to  the  number  of  processors,  and  the  second  means  that 


*  The  true  local  matrix  for  D  resembles  5,  but  some  diagonal  blocks  have  order  6  — 
1  instead  of  6,  and  some  other  off-diagonal  blocks  have  nonzero  rows  or  columns.  For 
example,  if  the  center  of  the  cross  is  placed  in  set  number  1,  then  there  is  a  nonzero  row 
in  the  (1,6)  block.  It  is  straightforward  to  show  that  the  operation  counts  are  higher  for 
this  simplified  matrix  5. 
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the  processor  grid  contains  some  interior  processors.  The  following  result  gives  an  upper 
bound  for  the  costs  of  Algorithm  3;  a  proof  is  given  in  the  Appendix. 

Theorem  3:  The  global  matrix  G  can  be  factored  using  the  parallel  nested  dissection 
algorithm  with  cost 

~m3p3k  —  §m3p3  log2  k  —  -~m3p3  arithmetic, 

~m2p2k  -  ^m2p2  log2  k  -  ^m2p2  communication, 

322 k  -  326log2  k  —  275  startups. 

As  in  steps  (a)-(c)  of  Algorithm  1,  this  computation  is  independent  of  the  number  of 
load  vectors.  It  remains  to  specify  the  costs  of  the  global  forward  elimination  and  back 
substitution,  which  we  assume  are  performed  in  serial  order,  once  for  each  load  vector.  For 
a  single  load  vector  6  =  local  to  D M,  the  forward  elimination  and  back  substitution 
are  as  follows: 

Algorithm  4:  Forward  elimination  and  back  substitution. 


Forward  elimination, 
for  i  =  1  to  4 
6,  -  L^bi 
for  j  =  i  +  1  to  pt 

if  {Sij  Jt  0)  4,  bj  -  S*b , 
end 
end 


Back  substitution 
for  *  =  4  to  1 
bi  —  0 

for  j  =  *  +  1  to  pt 

if  {Sij  #  0 )bt*~  bi  -  5,,-6,- 
end 

6,  L~Tbi 

end 


The  global  eliminations  consist  of  log2  k  steps  of  these  two  computations.  We  discuss  only 
the  forward  elimination;  the  costs  for  the  back  substitution  are  identical. 

Algorithm  5:  Global  forward  elimination. 

For  t  =  1  to  iog2  k 

Local  assembly:  Assemble  6^:  merge  the  four  load  vectors  from  step  t  —  1. 
Eliminate  the  interior  cross:  Apply  Algorithm  4  with  S  =  and  b  =  b^K 
end 

The  following  result  gives  the  costs  of  the  forward-  and  back-solves;  see  the  Appendix  for 
a  proof.  Note  that  the  asymptotic  cost  of  arithmetic  is  0(n2p2/Jk),  which  is  suboptimal. 
(The  serial  cost  is  0(n2p2  log2  n),  as  in  the  second  term  of  (4.1)  with  m  =  n.)  See  e.g. 
[12]  for  discussions  of  efficient  parallel  triangular  solution  schemes. 

Theorem  4:  The  cost  of  forward  solves  and  back  substitution  for  the  parallel  nested 
dissection  algorithm  applied  to  s  load  vectors  is 


6 m2p2k  +  -3*~-6  m2p2  log2  k  -  (18s  +  6)m2p2 
ZZmpk  —  (38s  -  30)mplog2  k  -  (28s  +  26 )mp 
66 k  +  (76 s  -  60)  log2  k  -  (56s  +  52) 


arithmetic, 

communication, 

startups. 


Finally,  the  storage  requirements  for  local  elimination  and  parallel  nested  dissection 
are  outlined  in  the  following  result. 
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Theorem  5:  The  high  order  storage  requirements  per  processor  for  local  elimination 
combined  with  global  elimination  by  parallel  nested  dissection  are 

^|m2p4  -I-  j3m2p3)  +  2i-m2p2  log2  m  for  local  elimination, 

30m2p2  log2  k  for  global  elimination, 

where  the  parenthesized  term  applies  only  for  p  >  4. 

Proof:  The  local  cost  comes  directly  from  expression  (4.1)  of  Theorem  1.  For  the  global 
factorization,  it  is  necessary  to  store  each  filled-in  version  of  computed  by  Algorithm 
2.  For  all  t,  the  nonzero  blocks  of  5 ^  are  dense  of  order  6t,  and  they  are  distributed  on  a 
Kt  x  Kt  grid,  so  each  block  requires  (^-)2  =  d2/ 4  locations  per  processor,  where  d  =  mp. 
The  number  of  such  blocks  is  126  for  1  <  t  <  log2  k  -  2,  and  54  and  14  for  t  =  log2  it  -  1 
and  t  =  log2  k,  respectively.  Since  both  and  [S(*)]T  are  used  (see  the  note  in  the  proof 
of  Lemma  1),  little  advantage  is  taken  of  symmetry;  the  only  exception  is  on  the  diagonal, 
where  just  the  lower  triangular  factor  La  must  be  stored.  Hence, 

—  (l26(log2  k  -  2)  +  54  +  14)  -  ~  (l2(log2  k  -  2)  +  8  +  4)  =  30 d2  log2  k  -  44.5d2 

storage  locations  are  needed  in  each  processor.  Q.E.D. 

Here,  we  are  ignoring  pointer  overhead  and  some  temporary  storage  that  facilitates  pipelin¬ 
ing  in  Algorithm  2  (see  the  proof  of  Lemma  2  in  the  Appendix) . 


4.3.  Parallel  Conjugate  Gradient  for  Interfaces 

In  this  section,  we  outline  the  costs  of  a  parallel  implementation  of  the  conjugate 
gradient  method  for  solving  the  global  system  (3.4).  Given  an  initial  guess  vo,  CG  consists 
of  the  following  iteration,  whose  major  computations  are  listed  at  the  right. 

Algorithm  6:  The  preconditioned  conjugate  gradient  method  (PCG). 
ro  <-  g  -  Gv 0,  f0  <-  A/_1r0,  po  <-  f0,  r0  <-  rjf0 


For  »  =  0  until  convergence  do 

• 

Wi  *—  Gvi 

Matrix-vector  product 

T)i  <-  pfwi,  a,  «-  Ti/rji 

Inner  product 

*  Xi  +  a, Pi 

Scalar-vector  product 

r»+i  «-  ri  -  aiWi 

Scalar-vector  product 

fi+i  *-  M_lri+1 

Preconditioning 

r.+i  «-  rf+lfi+u  0i  «-  Ti+i/Ti 

Inner  product 

Pi+l  ^i  +  l  +  PiPi 

Scalar-vector  product 

Thus,  each  iteration  requires  one  (parallel)  matrix-vector  product,  one  preconditioning 
solve  Mr  =  r,  two  inner  products  and  three  scalar-vector  products.  We  consider  unpre¬ 
conditioned  CG  (i.e.  where  M  is  the  identity  operator),  as  well  as  using  preconditioning 
by  by  a  sparse  approximation  of  G. 

Our  strategy  for  distribution  of  data  (except  the  preconditioner)  is  depicted  in  Figure 
4.3.  Each  processor  P  is  associated  with  a  super-element  D.  For  any  vector  v  associated 
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Figure  4.3:  Data  distribution  and  local  matrix  structure  for  the  conjugate  gradient  method. 

with  the  boundary  of  D,  P  contains  in  its  local  memory  the  entries  of  v  on  the  north 
and  east  boundaries,  excluding  the  northwest  and  southeast  entries.  For  most  processors 
there  are  2d  —  1  such  values,  where  d  =  mp.  (Exceptions  are  the  top  row  and  right 
column  of  processors  in  the  grid,  which  contain  no  north  boundary  and  no  east  boundary, 
respectively.  They  perform  less  work  than  the  other  processors  and  are  idle  some  of  the 
time.)  The  vector  v  is  divided  into  four  disjoint  components:  t >j  for  the  2d  —  1  entries  stored 
locally  in  P,  and  vw,  va  and  vaut  for  the  elements  of  v  stored  in  the  processors  to  the  left, 
below  and  below  left  of  P.  Let  A  denote  the  local  matrix  for  D  after  local  elimination. 

A  is  a  dense  4 d  x  4 d  matrix,  but  we  adopt  the  convention  that  the  block  diagonals  of  A 
are  assembled  to  reflect  the  locations  of  unknowns.  That  is,  the  (a,s),  (u;,u>)  and  ( sw,sw ) 
blocks  of  A  are  set  to  zero,  so  that  there  are  actually  14d2  -  1  nonzeros  in  A. 

The  costs  of  the  matrix  product  and  vector  operations  are  as  follows  (here  d  =  mp): 
Matrix-vector  product  y  <—  Ax:  Id.2  arithmetic  and  4(d+l)  communication,  with  4  startups. 
This  is  determined  from  the  following  steps  of  arithmetic  and  communication,  whose  costs 
are  listed  on  the  right: 

(1)  In  parallel:  send  xw  east,  xt  north  and  x,w  north.  1  startup,  send  d  words 

(2)  Send  xaw  east.  1  startup,  send  1  word 

(3)  Compute  y  *—  Ax  14 d2  —  1  arithmetic 

(4)  Send  yaw  west.  •  1  startup,  send  1  word 

(5)  In  parallel:  send  yw  west,  yt  south  and  ytw  south  1  startup,  send  d  words 

Inner  product  t  <—  All  domains  xT^:  2d  -  1  arithmetic,  and  2k  communication  and  2Jt 

startups  to  accumulate  the  sum  and  distribute  it  to  all  processors. 

Scalar-vector  product  y  ♦—  ax  for  scalar  a:  2d  —  1  arithmetic  and  no  communication. 

For  the  preconditioner  M,  we  consider  a  symmetric  permutation  of  the  matrix 


(j  :)■ 


where  G  is  a  global  interface  matrix  corresponding  to  a  subset  of  the  unknowns  on  the 
boundaries  of  the  local  interfaces.  For  example,  one  could  choose  the  corner  unknowns 
of  all  super-elements,  plus  one  unknown  from  each  side  of  every  super-element,  so  that 
G  has  the  form  of  a  low  order  operator  approximating  G.  The  permutation  maps  the 
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specified  unknowns  to  the  lowest  indices  in  the  natural  way.  Suppose  d  <  d  unknowns 
per  side  are  used  to  define  the  preconditioner.  Then  the  costs  of  applying  and  storing 
the  preconditioning  operator  are  obtained  by  replacing  mp  with  d  in  Theorems  4  and  5. 
Hence,  the  costs  of  the  conjugate  gradient  method  are  as  follows. 

Theorem  6:  The  cost  per  step  for  CG  without  preconditioning  is 

14m2p2  +  lOmp  -  6  arithmetic, 

2mp  +  4k  -f  6  communication, 

4k  -+-  4  startups. 

The  additional  cost  per  step  for  preconditioning  by  (4.2)  is  (for  d  >  2) 

6d2k  +  y d2  log2  k  -  24  arithmetic, 

33  dk  —  8  d  log2  k  —  54  communication, 

66  k  +  16  log2  k  startups. 

The  storage  requirements  (not  including  those  for  local  elimination)  are  14m2p2  +  20mp 
without  preconditioning  (for  A,  x,  r,  p  and  to),  and  an  additional  30m2  log2  k  +  Amp  with 
preconditioning  (for  G  in  factored  form  and  f).  Q.E.D. 

There  is  also  a  preprocessing  cost  for  factoring  G ,  obtained  by  replacing  mp  with  d  in 
Theorem  3.  Note  that  the  efficiency  of  the  unpreconditioned  algorithm  approaches  one  as 
the  problem  size  grows. 

An  implementation  of  unpreconditioned  CG  (with  benchmarks  on  a  hypercube)  that 
computes  an  extra  inner  product  but  decreases  the  startup  overhead  of  inner  products  is 
presented  in  [ll]. 


5.  Numerical  Experiments 

In  this  section,  we  describe  the  results  of  numerical  experiments  for  solving  a  model 
problem,  and  we  combine  these  results  with  the  cost  analysis  of  the  previous  section  to 
estimate  parallel  costs.  Consider  the  model  problem 


— Au  =  0  on  H  =  [0,1]  x  [0, 1], 


(5.1a) 


^(X1,°)  =  ^(°,X2)=0,  £(x1,l)=p1(x1),  £(l,X2)=n(*7),  (5-l&) 

where  gx  are  determined  so  that  the  solution  is 

2 

u(xi,x2)  =  ReMa?  +  z2)-1  +  (a2  -  z2)-1) - r,  a  >  1,  z  =  xi+ix2. 

a 1 

Note  that  u  is  a  harmonic  function  with  a  singularity  at  z  —  ±a,  z  -  ±ia  so  that  the 
solution  becomes  less  smooth  as  a  —*  1.  The  parameter  a  characterizes  the  regularity  of 
the  solution.  This  model  problem  is  a  characteristic  one  for  many  problems  in  structural 
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Figure  5.1:  Accuracy  of  the  finite  element  solutions. 

mechanics.  We  consider  two  versions  of  this  model:  Problem  1,  with  a  =  1.1,  and  Problem 
2,  with  a  =  1.05. 

Figure  5.1  shows  the  accuracy  ||u(S)  —  u1|b/||u||£;  obtained  using  elements  of  degree 
p  and  n  x  n  element  grids  with  n  =  2,  4,  and  8,  to  solve  (5.1)  for  the  two  choices  a  =  1.1 
and  a  =  1.05.  The  finite  element  computation  was  made  by  the  code  PROBE  [16]  on 
an  Apollo  3000D  in  double  precision.  For  use  with  the  parallel  analysis  of  the  previous 
section,  we  extrapolated  these  results  in  two  ways  to  get  data  for  finer  grids.  First,  we  used 
the  asymptotic  values  of  the  slopes  of  these  curves  to  extrapolate  to  values  for  n  =  16  and 
n  =  32.  In  the  figure,  the  extrapolated  results  are  indicated  by  dotted  lines.  Second,  we 
treated  this  problem  as  though  it  is  a  subproblem  of  a  larger  one  discretized  on  a  4n  x  4n 
element  grid,  in  which  fi  is  ^  of  a  larger  domain.  That  is,  our  operation  counts  are  for 
a  problem  posed  on  a  domain  with  four  times  as  many  elements  in  each  direction,  but  in 
which  the  accuracy  is  as  in  Figure  5.1.  In  the  following,  we  consider  n  x  n  element  grids 
with  n  =  8,  16,  32,  64  and  128.  Figure  5.2  graphs  accuracy  as  a  function  of  cost  for  serial 
computations,  for  direct  solves  using  several  choices  of  finite  element  method  and  the  two 
values  of  a. 

For  examining  parallel  costs,  unless  otherwise  specified,  we  fix  the  costs  of  arithmetic 
and  communication  as  follows.  A  floating  point  multiplication  is  normalized  to  take  one 
unit  of  time.  Communication  is  10  times  faster  than  arithmetic  but  incurs  a  startup  cost 
equal  to  the  time  required  to  perform  10  floating  point  multiplications  (or  send  100  words). 
Thus,  n  words  of  data  cam  be  sent  to  a  neighbor  in  10  +  .In  units  of  time.  (These  choices 
are  rough  approximations  to  observed  times  on  both  the  Intel  iPSC  and  Ncube  hypercube 
parallel  processors  [12].) 

Figures  5.3  -  5.6  show  the  costs  of  parallel  direct  solves  for  several  values  of  available 
parameters.  Figure  5.3  shows  accuracy  as  a  function  of  cost  for  direct  solution  on  64 
processors  arranged  in  an  8  x  8  grid  ( k  =  8),  for  a  =  1.1  and  a  =  1.05.  Comparison 
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Figure  5.4:  Parallel  costs  for  fixed  accuracy  and  varying  k  (Problem  1). 

cretization,  for  k  =  8  and  k  =  32.  The  upper  bound  on  speedup,  k2,  is  indicated  by  dashed 
lines.  The  maximal  speedup  is  slightly  greater  than  half  these  upper  bounds  (giving  ef¬ 
ficiency  of  about  50%).  There  are  two  reasons  that  the  asymptotic  speedup  is  less  than 
this  bound.  First,  symmetry  is  not  fully  exploited  in  the  parallel  computations:  during  the 
Cholesky  factorization  of  Algorithm  2,  computations  are  duplicated  in  the  upper  and  lower 
triangle  of  the  processor  grid,  and  during  the  block  forward  solves,  only  about  half  the 
processors  are  actually  computing.  (See  the  Appendix  for  details.)  Second,  the  parallel 
dissection  cannot  take  advantage  of  lower  costs  near  the  boundary  as  well  as  the  serial 
version;  this  is  reflected  in  the  difference  between  the  coefficient  of  m3p3  in  Theorem  3 
with  that  of  n3p3  in  Theorem  2.  We  also  observe  that  by  our  convention,  more  computa¬ 
tions  are  fully  local  for  higher  values  of  p,  so  that  maximal  speedup  is  achieved  for  smaller 
element  grids  for  these  values. 

In  Figure  5.6,  we  vary  the  relative  costs  of  communication  and  startups  for  the  same 
problem  considered  in  Figure  5.3,  left  (a  =  1.1).  For  Figure  5.6,  left,  communication  speed 
is  decreased  to  the  same  speed  as  arithmetic,  with  no  change  in  startup  cost.  For  Figure 
5.6,  right,  communication  speed  is  made  10  times  faster  than  arithmetic  (as  above),  but 
startup  cost  is  decreased  to  the  cost  of  one  multiplication.  Comparison  with  Figure  5.3, 
left,  indicates  that  the  (relatively  high)  cost  of  startups  significantly  degrades  performance, 
even  though  startups  are  a  lower  order  overall  cost,  whereas  the  cost  of  communication  is 
less  of  a  factor.  This  agrees  with  observations  made  in  [10] ,  although  it  is  also  known  that 
for  very  large  problems,  communication  costs,  which  are  of  lower  order  than  arithmetic 
costs,  will  be  negligable  [11]. 

For  the  conjugate  gradient  method,  the  standard  bound  on  the  error  at  the  j’th  step 
has  the  form  (lj 

||e<»||£<2(l-(.)'||el°>||s.  (5.2) 

Here,  ||e^||£  =  ||u^  —  u(S)||g,  the  energy  norm  of  the  discrete  error  at  the  j’th  CG 
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Figure  5.0:  Costs  for  two  other  models  of  parallel  costs,  k  =  8. 

* 

iteration,  and  the  decay  factor  1  —  p,  depends  on  the  iteration  matrix.  (For  uniform 
eigenvalue  distributions,  p  as  (condition  number) -l/2.)  We  study  the  performance  of  CG 
and  PCG  by  examining  its  performance  for  p  €  (0, 1),  for  solving  the  two  model  problems 
with  an  accuracy  comparable  to  that  achieved  with  a  direct  solver. 

In  particular,  let  the  desired  accuracy  be  1%.  From  the  data  used  to  produce  Figure 
5.1,  we  find  that  this  accuracy  is  achieved  for  Problem  1  when  (for  example)  p  =  2, 
n2  =  1024,  or  p  =  6,  n2  =  64,  and  for  Problem  2  when  p  =  4,  n2  =  1024  or  p  =  8, 
n2  =  100.  As  in  our  analysis  of  direct  methods,  we  simulate  a  finer  grid  by  replacing  n 


with  4n.  Therefore,  we  examine  the  conjugate  gradient  method  for  the  four  choices 
Problem  1:  p  =  2,  n  =  128  and  p  =  6,  n  =  32 
Problem  2:  p  =  4,  n  =  128  and  p  =  8,  n  =  40. 

From  (5.2),  approximately  ;  =  (log  e)/  log(l  —  p)  iterations  axe  needed  to  bound  the  relative 
error  ||e^||r/||e^||r  by  t.  We  use  this  estimate  on  iteration  counts,  with  the  choice 
e  =  10-3,  to  determine  the  costs  of  achieving  approximately  1%  accuracy.  Multiplying 
these  iteration  counts  by  the  cost  per  step  (taken  from  Theorem  6)  gives  the  overall  cost  of 
the  conjugate  gradient  method.  For  a  preconditioner,  we  consider  the  use  of  a  submatrix  of 
the  global  interface  matrix  G  corresponding  to  the  nodal  unknowns  on  the  super-elements, 
plus  one  side  unknown  from  every  side  of  the  super-elements  (so  that  d  =  2  in  Theorem 
6). 

Figure  5.7  compares  the  cost  of  CG  and  PCG  with  those  of  the  direct  solvers,  for 
solving  the  two  problems  with  one  load  vector  on  an  8  x  8  processor  grid.  The  results 
show  that  if  the  decay  factor  1  —  /i  is  much  less  than  one,  then  CG  and  PCG  will  be  more 
efficient  than  direct  solvers,  but  that  the  two  classes  of  methods  become  comparable  in 
cost  a s  p  — ►  0.  They  also  indicate  that  the  overhead  for  low-order  type  preconditioners  is 
not  a  significant  extra  expense.  (For  the  problems  considered,  the  number  of  points  mp 
on  each  interface  boundary  is  at  least  40,  much  larger  than  d  =  2.)  Figure  5.8  compares 
the  cost  of  CG  and  PCG  with  those  of  the  direct  solvers,  for  solving  the  two  problems 
with  thirty  load  vectors  on  an  8  x  8  grid.  Here,  the  factorization  for  the  preconditioner  is 
counted  only  once.  These  results  suggest  that  in  the  case  of  multiple  load  vectors,  parallel 
direct  solvers  will  be  highly  competitive  for  two-dimensional  problems,  even  if  the  rate  of 
convergence  of  CG  or  PCG  is  independent  of  mesh  size.  (We  remark  that  synchronization 
costs  for  the  CG  inner  products  are  of  low  order  for  the  problem  sizes  considered  here.) 
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Figure  5.7:  Comparison  of  conjugate  gradient  and  direct  solves  for  one  load  vector,  k  =  8. 

It  is  known  that  for  p  =  1,  p  behaves  like  1  /■>/nk  [4],  so  that  in  this  case  the  unpre¬ 
conditioned  algorithm  will  have  values  of  1  —  p  near  one.  We  expect  1  —  p  to  be  smaller 
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Figure  5.8:  Comparison  of  conjugate  gradient  and  direct  solves  for  thirty  load  vectors,  k  =  8. 

(even  without  preconditioning)  for  larger  p.  To  get  some  insight  into  this  point,  we  esti¬ 
mated  the  values  of  1  —  p  for  CG  for  applied  to  the  discretization  of  (5.1),  both  without 
preconditioning  and  with  a  particular  low  order  preconditioner,  for  several  values  of  n  and 
p.  (These  estimates  are  based  on  the  experiments  (3J.)  For  each  p,  the  elements  and 
super-elements  are  the  same,  and  local  elimination  of  the  internal  unknowns  is  performed 
in  each  element.  G  is  the  global  interface  matrix  corresponding  to  the  remaining  side  and 
nodal  unknowns,  and  the  preconditioner  G  is  the  submatrix  of  G  corresponding  to  p  =  1. 
The  results  are  shown  in  Table  1.  The  entries  are  the  average  values  of  ||e^)||£:/||e^_1)||£', 
taken  over  the  first  ten  iterations  of  CG,  using  a  zero  right  hand  side  and  smooth  initial 
guess.  The  left-hand  table  is  for  unpreconditioned  CG,  and  the  right-hand  table  is  for  pre¬ 
conditioning  of  the  p- version  interface  operator  by  the  h-version  interface  operator.  Note 
that  the  preconditioning  is  different  from  that  considered  above,  where  local  elimination 
is  also  applied  to  unknowns  coming  from  side  and  nodal  unknowns  inside  super-elements. 


isi  cad 


c* 

II 

a. 

p=4 

p=6 

p=8 

cs 

II 

c 

.383 

.521 

.572 

.616 

9 

II 

.595 

.644 

.672 

.699 

n=6 

.673 

.689 

.721 

.743 

9 

II 

00 

.744 

.762 

.765 

.770 

o 

9—4 

II 

c 

.794 

.808 

.805 

.804 

p=8 

p=  10 

.574 

.614 

.571 

.623 

.563 

.609 

.554 

.600 

.547 

.591 

Table  1:  Average  decay  factors  for  CG  applied  to  the  p- version  interface  operator,  without 
preconditioning  (left)  and  with  preconditioning  by  the  /i-version  interface  operator  (right). 


6.  Conclusions 


We  have  considered  some  issues  associated  with  parallel  solution  of  linear  problems 
arising  from  finite  element  analysis.  Although  the  quadratures  required  to  set  up  these 
systems  often  occupy  a  substantial  portion  of  the  computational  time,  these  computations 
are  fully  parallel;  consequently,  our  focus  has  been  on  the  costs  of  linear  system  solution. 
The  general  strategy  considered  is  to  divide  the  problem  among  available  processors,  and 
perform  local  elimination  wherever  possible.  The  unknowns  corresponding  to  “processor 
interfaces”  are  then  computed  using  either  direct  solvers  or  iterative  solvers.  Some  of  the 
conclusions  we’ve  reached  are: 

1.  Sparse  direct  solvers  based  on  nested  dissection  are  applicable  to  systems  arising 
from  both  the  h-  and  /ip-version  of  the  finite  element  method,  and  the  factoriza¬ 
tion  achieves  maximum  efficiency  of  approximately  50%. 

2.  The  use  of  high  order  elements  appears  to  be  a  natural  way  to  increase  the  amount 
of  local  computation  and  achieve  accuracy. 

3.  Sparse  direct  methods  are  highly  competitive  with  preconditioned  conjugate  gra¬ 
dient  methods,  especially  in  the  case  of  multiple  load  vectors.  This  is  despite  the 
fact  that  triangular  system  solution  does  not  achieve  optimal  order  speedup. 

We  have  not  addressed  two  critical  issues  of  finite  element  analysis  here,  namely  three- 
dimensional  problems  and  adaptivity.  Hierarchical  direct  solvers  of  the  type  considered 
here  can  be  used  for  three  dimensional  problems,  although  the  global  data  movements  will 
be  more  complex.  We  speculate  that  high  order  methods  will  be  of  use  in  this  regime, 
again  by  making  more  of  the  computations  local  in  a  natural  way,  at  the  possible  cost 
of  increased  local  storage  requirements.  A  very  useful  methodology  would  be  effective 
low  order  preconditioners  for  the  conjugate  gradient  method.  Adaptive  methods  are  less 
amenable  to  the  type  of  analysis  considered  here,  but  we  believe  that  methods  based  on 
hierarchical  use  of  high  order  elements  would  require  richer  interconnection  schemes  than 
the  processor  mesh  considered  here. 


Appendix 

In  this  section  we  present  proofs  for  the  cost  analyses  of  Sections  4.1  and  4.2.  The 
costs  of  local  elimination  are  derived  from  standard  analyses  of  serial  direct  methods. 
Proof  of  Theorem  1:  The  operation  counts  are  written  in  the  form  m2x  (cost  of  p- 
version)  +  (cost  on  local  interfaces).  The  costs  for  the  p- version  are  derived  from  the 
standard  analysis  of  dense  elimination  [8]:  it  is  known  that  the  number  of  multiplications 
required  to  factor  a  dense  matrix  of  order  a  is  ^a3  -I-  |a2  —  |a.  The  result  follows  from 
the  facts  that  in  most  elements,  A  has  order  a  =  A(p  -  2)(p  -  3)  and  B  is  a  full  matrix 
with  4p  columns.  The  result  for  local  interfaces  follows  directly  from  George’s  original 
analysis  of  nested  dissection  (7].  We  only  consider  the  case  of  k  >  4,  so  that  the  costs 
are  determined  by  the  (internal)  super-elements,  which  have  four  boundaries.  Hence,  in 
George’s  terminology,  the  operation  counts  are  those  for  elimination  of  “interior  subsets” 
in  the  dissection,  as  in  (7),  p.  360,  Lemma  A. 2.  At  the  j’th  step,  for  1  <  j  <  log2  m,  these 
counts  are  x  23;p3  -  17  x  22;p2  —  —  x  2 Jp  +  3),  and  the  number  of  nonzeros  in 


the  j’th  section  of  the  factors  is  (§r)2(^  x  2 2;p2  -  13  x  2Jp  +  3).  The  total  is  obtained  by 
summing  these  expressions  for  j  =  1  to  log2  m.  Q.E.D. 


Now  consider  the  parallel  global  matrix  factorization  (Algorithm  3).  The  analysis 
presented  is  a  generalization  of  the  analysis  of  [5j.  Data  is  arranged  as  follows.  Prior  to 
the  elimination  of  the  interior  cross  at  the  t’th  step,  5 W  is  a  block  pt  x  pt  matrix,  where 
for  1  <  t  <  log2  k  —  2,  pt  =  12.  SM  has  the  the  nonzero  pattern  of  Figure  4.2,  and  only  the 
block  upper  triangle  is  needed.  Each  block  of  has  order  St  =  2 t~1d  (where  d  =  mp), 
and  its  6 2  entries  are  distributed  on  a  k*  x  Kt  grid  of  processors,  where  Kt  =  2*.  When 
it  is  convenient,  we  use  the  symbols  S,  k  and  6  and  omit  the  counter  t.  The  following 
result  summarizes  the  costs  of  the  large  scale  computations  (Cholesky  factorization,  block 
forward  solve  and  matrix-matrix  product)  that  are  performed  one  or  more  timet  during 
each  of  four  steps. 

Lemma  1:  If  the  matrix  5  of  Algorithm  2  contains  blocks  of  order  6,  and  each  of  the 
blocks  is  equidistributed  on  a  k  x  k  grid  of  processors,  then  the  individual  large  scale 
computations  used  in  a  parallel  implementation  of  Algorithm  2  can  be  implemented  with 
the  following  costs: 

r  factorizations:  (r  -)-  2)«  -  2  arithmetic  steps 

(r  +  2)k  -  3  communication  steps 

r  block  forward  solves:  (r  +  2)k  ~  2  arithmetic  steps 

(3r  +  2)k  -  (2 r  -I-  3)  communication  steps 
r  matrix-matrix  products:  r/c  arithmetic  steps 

2 r*  -  2r  communication  steps, 

where  an  arithmetic  step  consists  of  (£)3  multiplications,  and  a  communication  step  con¬ 
sists  of  sending  (£)2  items  of  data  to  a  neighboring  processor. 

Proof:  We  consider  the  factorizations,  block  forward  solves  and  matrix  products  sepa¬ 
rately.  In  the  proof,  individual  matrices  distributed  on  the  processor  grid  are  indexed 
according  to  their  locations  in  the  grid,  i.e.  if  A  is  any  such  matrix,  then  A refers  to  the 
portion  of  A  stored  in  the  processor  indexed  by  (n,u),  1  <  p,  u  <  k.  The  symbols  5  and  L 
are  reserved  to  refer  to  the  matrices  of  Algorithm  2,  i.e.  is  a  block  of  5,  equidistributed 
among  the  processors. 

Cholesky  factorization.  Let  A  =  MMr  denote  the  Cholesky  factorization  of  a  block  5,,, 
1  <  i  <  4.  The  factorization  moves  in  a  series  of  waves  across  the  grid  where  each  wave 
computes  one  column  of  M  in  the  lower  triangle  of  the  processor  grid  and,  simultaneously, 
an  (identical)  row  of  Mr  in  the  upper  triangle  (see  [14]).  The  first  wave  is  shown  in  Figure 
A.l.  Simultaneous  steps  of  arithmetic  (ai)  and  communication  (e,)  are  identified  by  diag¬ 
onal  lines.  The  steps  are  synchronized  as  in  the  figure,  so  that  the  costs  of  arithmetic  are 
determined  by  the  matrix-matrix  products,  which  require  (£)3  multiplications;  similarly, 

the  communication  cost  is  (-)  words  with  one  startup.  The  last  step  of  this  wave,  which 
takes  place  in  the  (k,k)  (bottom  right)  processor,  is  completed  after  2k  -  1(  =  7)  steps  of 
arithmetic  and  2k  —  2(  =  6)  steps  of  communication.  The  second  wave  is  performed  on  the 
lower  right  (k  —  1)  x  (k  —  1)  grid  of  processors,  beginning  in  processor  (2,2)  at  step  a*. 
The  factorization  is  completed  after  3k  -  2  arithmetic  and  3k  —  3  communication  steps. 


When  r  factorizations  are  pipelined  (we  are  concerned  only  with  r  =  1,2),  it  is  necessary 
that  the  waves  of  each  factorization  do  not  collide  with  any  from  am  earlier  one.  This  is 
achieved  by  having  each  factorization  begin  k  steps  after  the  previous  one,  in  the  (1,1) 
processor.  The  cost  of  r  factorizations  is  then  (r  +  2)/c  —  2  arithmetic  steps  and  (r  +  2)*-2 
communication  steps. 


Figure  A.l:  The  first  wave  of  a  Cholesky  factorizaton  on  a  4  x  4  processor  grid. 

Block  forward  solve.  There  are  r  computations  of  the  form  B  —  A/-1  A,  where  M  is  the 
lower  triangular  factor  distributed  in  the  lower  triangle  of  the  processor  grid,  and  A  and  B 
are  square  matrices  equidistributed  among  the  processor  grid.  (In  Algorithm  2,  M  =  La 
and  A  =  Sij  for  some  appropriate  i,j.)  The  block  solve  requires  one  sweep  across  the  grid 
for  each  column  of  A.  Figure  A.2  shows  the  computation  of  {B^i  |  1  <  n  <  /c},  with  the 
result  stored  in  the  diagonal  processors.  2/e  —  1  arithmetic  steps  and  2/c  —  2  communication 
steps  are  needed.  If  all  the  block  columns  of  A  have  been  positioned  in  the  left  processor 
column,  then  this  step  can  be  pipelined  the  computation  for  the  »’th  column  begins  at 
(arithmetic)  step  i  and  ends  at  step  i  +  2/c  —  2.  Similarly,  the  computation  for  r  block 
matrices  A  requires  (r  +  2)/c  -  2  arithmetic  steps  and  (r  +  2)/c  —  3  communication  steps. 
As  above,  the  cost  of  each  step  is  (£)3  multiplications  and  (£)2  communication  with 
one  startup.  There  is  a  preprocessing  cost  of  r(/c  —  1)  communication  steps  to  position  the 
columns  of  A,  and  a  postprocessing  cost  of  r(/c  - 1)  to  correctly  place  the  computed  columns 
of  B*  The  total  is  (r  +  2)/c  -  2  arithmetic  steps  and  (3r  4-  2)/e  —  (2r  +  3)  communication 
steps. 

*  In  the  postprocessing  step,  £,;  is  moved  (horizontally)  from  the  diagonal  processor 
(»,»)  to  processor  (*,  j).  Since  BT  is  required  for  the  matrix-matrix  product  of  Algorithm 
2,  Bij  is  simultaneously  moved  (vertically)  to  processor  (j,  i). 
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Figure  A. 2:  The  block  forward  solve  for  the  first  column  on  a  4  x  4  processor  grid. 

Matrix-matrix  product.  These  computations  have  the  form  BTA,  which  requires  a  bidi¬ 
rectional  horizontal  communication  of  all  columns  of  BT  (see  the  footnote  in  the  previous 
paragraph)  and,  simultaneously,  bidirectional  vertical  communication  of  ail  rows  of  A. 
This  is  followed  by  the  block  matrix  sum  CM„  =  *n  processor  (/x,i/).  The 

cost  is  r/c  arithmetic  steps  and  2r(k  -  1)  communication  steps.  Q.E.D 
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2  factorizations 

12  solves 

42  matrix  products 

2  factorizations 

8  solves 

14  matrix  products 

2  factorizations 

4  solves 

6  matrix  products 

i  =  3 

1  factorization 

9  solves 

45  matrix  products 

1  factorization 

5  solves  . 

15  matrix  products 

1  factorization 

1  solve 

1  matrix  product 

i  =  4 

1  factorization 

8  solves 

36  matrix  products 

1  factorization 

4  solves 

10  matrix  products 

1  factorization 

Table  2:  Number  of  instances  of  large  scale  computations  in  Algorithm  2. 


Table  2  shows  the  number  r  of  times  each  of  the  large  scale  tasks  is  done.  The  rows  of 
the  table  correspond  to  places  where  the  factorizations  and  forward  solves  can  be  pipelined. 
Substitution  of  the  values  of  r  from  Table  2  into  the  expressions  of  Lemma  1  then  gives 
the  number  of  steps  required  to  perform  Algorithm  2: 

Lemma  2:  Under  the  hypotheses  of  Lemma  1,  the  number  of  steps  needed  in  a  parallel 
implementation  of  Algorithm  2  is  as  follows: 


1  <  t  <  log2  k  —  2 
t  =  log  2  k  -  1 
t  =  log  j  k. 


Arithmetic  steps:  Communication  steps: 

168*t  -  12  349/c*  -  322 

72 Kt  -  12  145/ct  -  130 

26**  -  10  43/ct  -  39 

Let  TW  denote  the  lower  right  square  submatrix  of  S W  below  row  4;  for  t  <  log2  fc, 
T<*>  is  an  8  x  8  block  matrix.  After  the  elimination,  the  block  upper  triangle  of  the  lower 
right  of  rW  has  filled  in.  Prior  to  this  computation,  four  such  8x8  block  matrices  on  four 
separate  /c«_i  x  /c*— i  processor  subgrids  are  redistributed  to  the  t’th  grid  and  merged  into 
5  =  S(*K  The  cost  of  this  operation  is  summarized  by  the  following  result. 

Lemma  3:  For  the  matrix  S  of  Lemma  1,  the  number  of  communication  steps  needed  to 
merge  the  local  matrices  at  step  t  of  Algorithm  3  is 

20/c*  t  =  1 

24 Kt  -  4  2  <  t  <  log2  k  -  1 

8xt  -  4  t  =  log2  k. 

Proof:  For  t  <  log2  k  —  1,  let  T  =  T ^  denote  the  8x8  matrix  from  step  t  —  1  as  above, 
and  let  the  four  subdomain  quadrants  be  denoted  {Z?tJ  1 1  <  t,j  <  2}  (see  Figure  4.2). 
T  is  associated  with  one  of  the  quadrants,  wlog,  Z?i,i.  Each  block  of  T  has  order  ^  and 
is  distributed  on  the  x  grid.  Now  let  T  be  relabeled  as  a  block  4x4  matrix  in 
which  each  new  block  is  square  (of  order  6t)  and  contains  four  old  blocks.  Let  U  denote 
one  of  these  new  blocks  of  T;  by  definition,  U  can  also  be  thought  of  as  divided  into  four 
quadrants.  The  data  movement  needed  for  merging  is  to  move  each  quadrant  of  U  to  the 
corresponding  quadrant  of  processors  for  the  subdomains.  The  relocation  of  U  can  proceed 
in  the  following  four  steps: 

1.  Move  Ult2  from  quadrant  (1,1)  to  quadrant  (1,2)  (clockwise). 

2.  Move  U2, i  from  quadrant  (1,1)  to  quadrant  (2,1)  (counterclockwise). 

3.  Move  U2,2  from  quadrant  (1,1)  to  quadrant  (1,2)  (clockwise). 

4.  Move  U2,2  from  quadrant  (1,2)  to  quadrant  (2,2)  (clockwise). 

There  are  10  nonzero  blocks  of  the  form  of  U  in  T,  so  that  40  movements  of  this  type  are 
needed  to  relocate  all  of  T.  At  each  step,  data  must  be  moved  across  ^  processors,  so 
that  the  number  of  communication  steps  is  20**.  Data  from  the  other  three  quadrants  is 
simultaneously  relocated  in  an  analogous  manner.  This  is  all  the  communication  required 
for  t  <  logfc  —1,  except  that  prior  to  these  steps,  the  lower  left  quadrant  of  the  (new) 
diagonal  blocks  of  T  (in  the  (2, 1)  position)  are  not  explicitly  represented  when  t  >  1; 
these  quadrants  can  be  made  available  at  cost  4 («*  —  1).  The  same  analysis  applies  for 

t  =  log2  k,  except  in  this  case  T  is  a  4  x  4  block  matrix,  or  2  x  2  after  relabeling,  and  only 

two  transpose  blocks  must  be  computed.  Q.E.D. 

Proof  of  Theorem  3:  The  result  is  obtained  by  summing  the  expressions  of  Lemmas  2 
and  3  for  t  =  1  to  log 2k,  and  using  the  values  St  =  2*<f/2,  ««  =  2*,  and  the  facts  that 
each  arithmetic  step  requires  (^-)3  =  d3/ 8  multiplications  and  each  communication  step 
requires  one  startup  plus  sending  of  (|^)2  =  d2/ 4  words.  Q.E.D. 


=  1  and  2 


1  <  t  <  logj  k  -  2 
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t  =  log  2  k 
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12s  matrix  products 

2 3  solves 

8s  matrix  products 

2 a  solves 

4s  matrix  products 

s  solves 

93  matrix  products 

3  solves 

53  matrix  products 

3  solves 

3  matrix  product 

3  solves 

8s  matrix  products 

3  solves 

43  matrix  products 

3  solves 

Table  3:  Number  of  instances  of  large  scale  computations  in  Algorithm  4. 


The  analysis  of  the  global  forward  and  back  solution  steps  is  similar.  We  consider  only 
the  forward  solve,  for  which  the  large  scale  computations  are  lower  triangular  matrix  solves 
and  matrix-vector  products.  Table  3  shows  the  number  of  such  computations  required  for 
a  load  vectors.  Every  right  hand  side  b  =  6^  at  step  t  is  (for  t  <  log2  k  —  2)  a  vector  with 
12  blocks  of  size  St;  assume  that  each  of  these  blocks  is  distributed  among  the  diagonal 
processors  of  the  Kt  x  Kt  grid.  If  there  are  s  such  right  hand  sides,  then  the  cost  of  the 
forward  elimination  is  determined  from  the  following  result. 

Lemma  4:  On  a  k  x  k  processor  grid,  the  individual  large  scale  computations  of  Algorithm 
4  can  be  implemented  with  the  following  costs: 


ra  forward  solves:  2/c  4-  ra  -  2  arithmetic  steps 

4/c  4-  3rs  -  7  communication  steps 
ra  matrix-vector  products:  ra  arithmetic  steps 

2/c  4-  a  -  2  communication  steps  (*  =  1, 2, 3) 

2k  4-  ra  —  2  communication  steps  (*  =  4). 

where  an  arithmetic  step  consists  of  (|)  multiplications  and  a  communication  step  con¬ 
sists  of  sending  £  items  of  data  to  a  neighboring  processor. 

Proof:  The  operations  and  costs  required  for  ra  forward  solves  are  as  follows. 

(1)  Relocate  ra  vectors  from  the  diagonal  processors  to  the  leftmost  processors  in  the  grid. 
With  pipelining,  the  cost  is  k  4-  ra  —  2  communication  steps. 

(2)  Compute  ra  solves  of  the  form  b  «—  M~lb.  Here,  M  is  some  triangular  factor  Ln 
produced  by  Algorithm  3  (see  the  proof  of  Lemma  1).  The  cost  is  2/c  4-  ra  —  2  arithmetic 
steps  and  2/c  4-  4s  —  3  communication  steps.  Each  resulting  b  is  located  in  the  diagonal 
processors. 

(3)  Distribute  each  6  across  processor  columns,  i.e.  move  a  copy  of  the  part  of  b  in  the 

(n,n)  processor  to  each  processor  1  <  u  <  k.  The  cost  is  k  4-  ra  —  2  communication 

steps. 

The  ra  matrix-vector  products  then  require  ra  arithmetic  steps,  using  the  matrices  Sjj 
constructed  by  Algorithm  3.  The  results  must  ultimately  be  summed  across  the  processor 
rows  and  stored  in  the  diagonal  processors.  At  step  i  of  Algorithm  4,  1  <  »  <  3,  it  is  only 
necessary  to  do  this  for  each  (of  a  versions  of)  6,+i,  which  requires  k  4-  a  —  2  communication 
steps.  At  step  *  =  4,  r a  accumulations  are  required,  at  cost  k  4-  ra  —  2  communication 
steps.  Q.E.D. 


ra  matrix-vector  products: 


arithmetic  steps 
communication  steps 
arithmetic  steps 

communication  steps  (t  =  1,2,3) 
communication  steps  (»  =  4). 


I 


Lemma  5:  The  number  of  steps  needed  in  a  parallel  implementation  of  Algorithm  5  is  as 
follows: 

Arithmetic  steps:  Communication  steps: 

6«t  +  33s  -  6  15*t  +  22s  -  27  1  <  t  <  log2  k  -  2 

6  Kt  +  21s  —  6  15#ct  +  18*  —  27  t  =  log2  k  —  1 

6 Kt  +  9s  —  6  14k*  +  14s  —  23  t  =  log2  k. 

Proof:  Substitute  the  values  of  r  from  Table  3  into  the  expressions  of  Lemma  4.  Q.E.D. 
Lemma  0:  The  number  of  communication  steps  needed  to  merge  s  local  load  vectors  at 
step  t  of  Algorithm  5  is 

3«t/2  +  16s  -  3  1  <  t  <  log 2  k  —  1 

3<t/2  +  8s-3  t  =  log2  k. 

Proof:  We  only  discuss  the  case  1  <  t  <  log2  k  —  1.  Before  the  elimination  step  there  are 
four  quadrants  of  x  processor  grids.  For  every  load  vector  6,  there  are  eight  blocks 
(fy}/=5  distributed  among  the  diagonal  processors  in  each  of  these  quadrants.  The  stategy 
for  merging  is  to  move  the  even  numbered  blocks  be,  b g,  6jo,  612  down,  to  the  diagonal 
processors  of  the  lower  (southern)  quadrants,  and  the  odd  numbered  blocks  65,  67,  69,  bn 
up  (north).  The  cost  is  2(^  +  4s  -  l)  communication  steps  for  the  bidirectional  move. 
Then,  in  the  north  quadrants,  all  eastern  data  is  moved  west  to  the  diagonal  blocks,  and  in 
the  south  quadrants,  all  western  data  is  moved  east.  The  cost  is  ^  +8s  —  1  communication 
steps.  Q.E.D. 

Proof  of  Theorem  4:  Sum  the  expressions  of  Lemmas  5  and  6  for  t  =  1  to  log2  k,  where 
each  arithmetic  step  requires  (^-)  =  d? / 4  multiplications  and  each  communication  step 
requires  one  startup  plus  sending  of  ^  =  d/2  words.  Q.E.D. 
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